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Abstract 

We study the relation between support vector machines (SVMs) for regression (SVMR) 
and SVM for classification (SVMC). We show that for a given SVMC solution there exists 
a SVMR solution which is equivalent for a certain choice of the parameters. In particular 
our result is that for e sufficiently close to one, the optimal hyperplane and threshold for 
the SVMC problem with regularization parameter C c are equal to -^i times the optimal 
hyperplane and threshold for SVMR with regularization parameter C r = (1 — e)C c . A direct 
consequence of this result is that SVMC can be seen as a special case of SVMR. 
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1 Introduction 

We assume that the reader has some familiarity with Support Vector Machines. In this section, 
we provide a brief review, aimed specifically at introducing the formulations and notations we 
will use throughout this paper. For a good introduction to SVMs, see [1] or [2]. 
In the support vector machine classification problem, we are given / examples (x 1; yi), . . . , (x^, yi), 
with Xj G W 1 and yi G {—1, 1} for all i. The goal is to find a hyperplane and threshold (w, b) 
that separates the positive and negative examples with maximum margin, penalizing points inside 
the margin linearly in a user-selected regularization parameter C > 0. The SVM classification 
problem can be restated as finding an optimal solution to the following quadratic programming 
problem: 

(C) min i||w|| 2 + CE- =1 & 
w,6,C 

yi(w-Xi + b) >l-& i = l,...,l 

£ >0 

This formulation is motivated by the fact that minimizing the norm of w is equivalent to maxi- 
mizing the margin; the goal of maximizing the margin is in turn motivated by attempts to bound 
the generalization error via structural risk minimization. This theme is developed in [2]. 
In the support vector machine regression problem, the goal is to construct a hyperplane that lies 
"close" to as many of the data points as possible. We are given I examples (xi,yi), . . . , (x;,7/;), 
with Xj G K ra and yi G R for all i. Again, we must select a hyperplane and threshold (w,b) . 
Our objective is to choose a hyperplane w with small norm, while simultaneously minimizing the 
sum of the distances from our points to the hyperplane, measured using Vapnik's e-insensitive 
loss function: 

|„ -fw.x +MI -I ° if |^-(w- Xj + 6)| <e 

\y t { w x J + oj| e | | y ._( w . x . + 6 )|_ e otherwise K) 

The parameter e is preselected by the user. As in the classification case, the tradeoff between 
finding a hyperplane with small norm and finding a hyperplane that performs regression well is 
controlled via a user selected regularization parameter C. The quadratic programming problem 
associated with SVMR is: 



(TV) min 


\ II w|| 2 + C ELte + £) 


w,6,e,e 






yi - (w • Xj + b) < e + & 




- Vl + (w • x, + b) <e + £ 




e,e >o 



1 = 1,...,/ 
i = l,...,l 

The main aim of this paper is to demonstrate a connection between support vector machine 
classification and regression. 

In general, SVM classification and regression are performed using a nonlinear kernel K(xi,Xj). 
For simplicity of notation, we chose to present our formulations in terms of a linear separating 
hyperplane w. All our results apply to the nonlinear case; the reader may assume that we are 
trying to construct a linear separating hyperplane in a high-dimensional feature space. 



1 Observe that now the hyperplane will reside in n + 1 dimensions. 



2 From Regression to Classification 

In the support vector machine regression problem, the yi are real-valued rather than binary- 
valued. However, there is no prohibition against the yi being binary-valued. In particular, if 
yi G { — 1,1} for all i, then we may perform support vector machine classification or regression 
on the same data set. 

Note that when performing support vector machine regression on { — 1, l}-valued data, if e > 1, 
w = 0, £ = 0, £* = is an optimal solution to R. Therefore, we restrict our attention to 
cases were e < 1. Loosely stated, our main result is that for e sufficiently close to one, the 
optimal hyperplane and threshold for the support vector machine classification problem with 
regularization parameter C c are equal to j^- times the optimal hyperplane and threshold for the 
support vector machine regression problem with regularization parameter C r — (1 — e)C c . We 
now proceed to formally derive this result. 
We make the following variable substitution: 

Combining this substitution with our knowledge that yi G { — 1,1} yields the following modifica- 
tion of R: 

(R') min Hlwf + CELifa + tf) 
w, b,i],r]* 

yi(w-Xi + b) >l-e-7]i i = l,...,l 

Vilw-Xi + b) <l + e + r)* i = l,...,l 

Continuing, we divide both sides of each constraint by 1 — e, and make the variable substitutions 
(R") min U^T + ^XUM + Vt')) 

W',b',7]',7]*' 

yi(w' ■ Xi + b) > 1 - rfc i = l,...,l 

^(w'-Xi + ft) <^ e +vf i = l,...,Z 

7/, 77*' >0 

Looking at formulation 72.", one suspects that as e grows close to 1, the second set of con- 
straints will be "automatically" satisfied with rj* = 0. We confirm this suspicion by forming the 
Lagrangian dual: 

(RV") min iE| J= i(A-/5;)Ai(A-^)-Ei/3i + SfEi/3; 

A, A* >o 
A,A* <i?i 

where 7J is the symmetric positive semidefinite matrix defined by the equation Dij = y^jXiXj. 
For all e sufficiently close to one, the 77* will all be zero: to see this, note that rj = 0, rj* = is a 
feasible solution to RV" with cost zero, and if any 77* is positive, for e sufficiently close to one, 
the value of the solution will be positive. Therefore, assuming that e is sufficiently large, we may 



eliminate the rf terms from 1Z" and and the /3* terms from T>". But removing these terms from 
1Z" leaves us with a quadratic program essentially identical to the dual of formulation C: 

(CV") min i£i i= iAAi&-£i/3i 
P 

Ei ViPi = o 
A >o 

Going back through the dual, we recover a slightly modified version of C: 
(C) min !||w|| 2 + ££S=i& 

w,6,£ 

j/i(w • Xi + 6) > 1 - & i = l,...,l 

Starting from the classification problem instead of the regression problem, we have proved the 
following theorem: 

Theorem 2.1 Suppose the classification problem C is solved with regularization parameter C , 
and the optimal solution is found to be (w, 6). Then, there exists a value a G (0,1) such that 
Ve G [a, 1), if problem 1Z is solved with regularization parameter (1 — e)C , the optimal solution 
will be (1 — e)(w, b). 

Several points regarding this theorem are in order: 

• The r) substitution. This substitution has an intuitive interpretation. In formulation TZ, 
a variable £j is non-zero if and only if y^ lies above the e-tube, and the corresponding £* is 
non-zero if and only if yi lies below the e-tube. This is independent of whether y^ is 1 or — 1. 
After the r) substitution, r)i is non-zero if y^ — 1 and y^ lives above the e-tube, or if y^ — — 1 
and yi lives below the e-tube. A similar interpretation holds for the rj*. Intuitively, the r\i 
correspond to error points which lie on the same side of the tube as their sign, and the 77* 
correspond to error points which lie on the opposite side. We might guess that as e goes to 
one, only the former type of error will remain: the theorem provides a constructive proof 
of this conjecture. 

• Support Vectors. Examination of the formulations, and their KKT conditions, shows 
that there is a one-to-one correspondence between support vectors of C and support vectors 
of TZ under the conditions of correspondence. Points which are not support vectors in C and 
therefore lie outside the margin and are correctly classified will lie strictly inside the e-tube 
in 1Z. Points which lie on the margin in C will lie on the boundaries of the e-tube in 1Z, and 
are support vectors for both problems. Finally, points which lie inside the margin or are 
incorrectly classified in C will lie strictly outside the e-tube, above the tube for points with 
y = 1, below the tube for points with y = — 1, and are support vectors for both problems. 

• Computation of a. Using the KKT conditions associated with problem (TZ"), we can 
determine the value of a which satisfies the theorem. To do so, simply solve problem C, and 
choose a to be the smallest value such that when the constraints (w'-Xj + fr) < jzt + tj^ ,i = 
1, . . . , I are added, they are satisfied by the optimal solution to C. In particular, if we define 
m to be the maximal value of t/j(w -Xj + b), then a = ^^ will satisfy the theorem. Observe 
that as w := ||w|| gets larger (i.e., the separating hyperplane gets steeper), or as the 



correctly classified Xj get relatively (in units of the margin w^ 1 ) farther away from the 
hyperplane we expect a to increase. More precisely it is easy to see that m < wD, with D 
the diameter of the smallest hypersphere containing all the points. Then a > ^Ej, which 
is an increasing function of w. Finally observe that incorrectly classified points will have 
2/i(w • Xj + b) < 0, and therefore they cannot affect m or a. 

• The £ 2 case. We may perform a similar analysis when the slack variables are penalized 
quadratically rather than linearly. The analysis proceeds nearly identically. In the tran- 
sition from formulation 1Z! to TZ" , an extra factor of (1 — e) falls out, so the objective 
function in 1Z" is simply ^||w'|| 2 + C (I]j =1 ((?7-) 2 + (vf) 2 ))- The theorem then states that 
for sufficiently large e, if (w, b) solves C with regularization parameter C, (1 — e)(w, b) 
solves 1Z, also with regularization parameter C. 

3 Examples 

In this section, we present two simple one-dimensional examples that help to illustrate the the- 
orem. These examples were both performed penalizing the £$ linearly. 

In the first example, the data are linearly separable. Figure la shows the data points, and 
Figure lb shows the separating hyperplane found by performing support vector classification 
with C = 5 on this data set. Note that in the classification problem, the data lie in one 
dimension, with the y- values being "labels" . The hyperplane drawn shows the value of w ■ x + b 
as a function of x. The computed value of a is approximately .63. Figure lc shows the e-tube 
computed for the regression problem with e = .65, and Figure Id shows the same for e = .9. 
Note that every data point is correctly classified in the classification problem, and that every 
data point lies inside the e-tube in the regression problems, for the values of e chosen. 
In the second example, the data are not linearly separable. Figure 2a shows the data points. 
Figure 2b shows the separating hyperplane found by performing classification with C = 5. The 
computed value of a is approximately .08. Figures 2c and d show the regression tubes for e = .1 
and e = .5, respectively. Note that the points that lie at the edge of the margin for classification, 
x = —5 and x = 6 lie on the edge of the e-tube in the regression problems, and that points that 
lie inside the margin, or are misclassified, lie outside the e-tube. The point x = —6, which is 
the only point that is strictly outside the margin in the classification problem, lies inside the 
e-tubes. The image provides insight as to why a is much smaller in this problem than in the 
linearly separable example: in the linearly separable case, any e-tube must be shallow and wide 
enough to contain all the points. 

4 Conclusions and Future Work 

In this note we have shown how SVMR can be related to SVMC. Our main result can be 
summarized as follows: if e is sufficiently close to one, the optimal hyperplane and threshold 
for the SVMC problem with regularization parameter C c are equal to j^- times the optimal 
hyperplane and threshold for SVMR with regularization parameter C r — (1 — e)C c . A direct 
consequence of this result is that SVMC can be regarded as a special case of SVMR. An important 
problem which will be study of future work is whether this result can help place SVMC and SVMR 
in the same common framework of structural risk minimization. 
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(c) Regression, e = .65 




(d) Regression, e = .9 



Figure 1: Separable data. 
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(c) Regression, e = .1 




(d) Regression, e = .5 



Figure 2: Non-separable data. 



